Curated list of information retrieval and web search resources from all
      around the web. ## Introduction
      Information Retrieval
      involves finding relevant information for user queries, ranging from
      simple domain of database search to complicated aspects of web search (Eg
      - Google, Bing, Yahoo). Currently, researchers are developing algorithms
      to address
      Information Need
      of user(s), by maximizing
      User and Topic Relevance
      of retrieved results, while minimizing
      Information Overload
      and retrieval time. ## Contributing Please feel free to send me
      pull requests
      or [email] (mailto:harshal.priyadarshi@utexas.edu) me to add new links. I
      am very open to suggestions and corrections. Please look at the
      contributions guide.
    
    Contents
    
    Books
    
      - 
        Introduction to Information Retrieval
        - C.D. Manning, P. Raghavan, H. Schütze. Cambridge UP, 2008. (First book
        for getting started with Information Retrieval).
      
 
      - 
        Search Engines: Information Retrieval in Practice
        - Bruce Croft, Don Metzler, and Trevor Strohman. 2009. (Great book for
        readers interested in knowing how Search Engines work. The book is very
        detailed).
      
 
      - 
        Modern Information Retrieval
        - R. Baeza-Yates, B. Ribeiro-Neto. Addison-Wesley, 1999.
      
 
      - 
        Information Retrieval in Practice
        - B. Croft, D. Metzler, T. Strohman. Pearson Education, 2009.
      
 
      - 
        Mining the Web: Analysis of Hypertext and Semi Structured Data
        - S. Chakrabarti. Morgan Kaufmann, 2002.
      
 
      - 
        Language Modeling for Information Retrieval
        - W.B. Croft, J. Lafferty. Springer, 2003. (Handles Language Modeling
        aspect of Information Retrieval. It also extensively details
        probabilistic perspective in this domain, which is interesting).
      
 
      - 
        Information Retrieval: A Survey
        - Ed Greengrass, 2000. (Comprehensive survey of Conventional Information
        Retrieval, before Deep Learning era).
      
 
      - 
        Introduction to Modern Information Retrieval
        - G.G. Chowdhury. Neal-Schuman, 2003. (Intended for students of library
        and information studies).
      
 
      - 
        Text Information Retrieval Systems
        - C.T. Meadow, B.R. Boyce, D.H. Kraft, C.L. Barry. Academic Press, 2007
        (library/information science perspective).
      
 
    
    Courses
    
    Software
    
      - 
        Apache Lucene - Open Source
        Search Engine that can be used to test Information Retrieval Algorithm.
        Twitter uses this core for its real-time search.
      
 
      - 
        The Lemur Project - The Lemur
        Project develops search engines, browser toolbars, text analysis tools,
        and data resources that support research and development of information
        retrieval and text mining software.
        
          - 
            Indri Search Engine
            - Another Open Source Search Engine competitor of Apache Lucene.
          
 
          - 
            Lemur Toolkit -
            Open Source Toolkit for research in Language Modeling, filtering and
            categorization.
          
 
        
       
    
    Datasets
    Standard IR Collections
    
      - 
        DBPedia - Linked
        data web.
      
 
      - 
        Cranfield Collections
        - This is one of the first collections in IR domain, however the dataset
        is too small for any statistical significance analysis, but is
        nevertheless suitable for pilot runs.
      
 
      - 
        TREC Collections - TREC is
        the benchmark dataset used by most IR and Web search algorithms. It has
        several tracks, each of which consists of dataset to test for a specific
        task. The tracks along with suggested use-case are:
        
          - 
            Blog - Explore
            information seeking behavior in the blogosphere.
          
 
          - 
            Chemical IR -
            Address challenges in building large chemical testbeds for chemical
            IR.
          
 
          - 
            Clinical Decision Support
            - Investigate techniques to link medical cases to information
            relevant for patient care.
          
 
          - 
            Confusion -
            Study
            Known Item Searching
            problem.
          
 
          - 
            Contextual Suggestion
            - Investigate search techniques for complex information needs
            (context and user interests based).
          
 
          - 
            Crowdsourcing -
            Explore crowdsourcing methods for performing and evaluating search.
          
 
          - 
            Enterprise -
            Study search over the organization data.
          
 
          - 
            Entity - Perform
            entity-related search (find entities and their properties) on Web
            data.
          
 
          - 
            Filtering -
            Binarily decide retrieval of new incoming documents given a stable
            information need.
          
 
          - 
            Federated Web Search
            - Study merge performance for results from various search services.
          
 
          - 
            Genomics -
            Study retrieval efficiency of genomics data and corresponding
            documentation.
          
 
          - 
            HARD - Obtain High
            Accuracy Retrieval from Documents by leveraging searcher’s context.
          
 
          - 
            Interactive Track
            - Study user interaction with text retrieval systems.
          
 
          - 
            Knowledge base acceleration
            - Study algorithms that improve efficiency of human Knowledge Base.
          
 
          - 
            Legal Track -
            Study retrieval systems that have high recall for legal documents
            use case.
          
 
          - 
            Medical Track -
            Explore unstructured search performance over patients record data.
          
 
          - 
            Microblog Track
            - Examine satisfaction of real-time information need for
            microblogging sites.
          
 
          - 
            Million Query Track
            - Explore ad-hoc retrieval over large set of queries.
          
 
          - 
            Novelty Track -
            Investigate systems’ abilities to locate new (non-redundant)
            information.
          
 
          - 
            Question Answering Track
            - Test systems that scale beyond document retrieval, to retrieve
            answers to factoid, list and definition type questions.
          
 
          - 
            Relevance Feedback Track
            - For deep evaluation of relevance feedback processes.
          
 
          - 
            Robust Track -
            Study individual topic’s effectiveness.
          
 
          - 
            Session Track -
            Develop methods for measuring multiple-query sessions where
            information needs drift.
          
 
          - 
            SPAM Track -
            Benchmark spam filtering approaches.
          
 
          - 
            Tasks Track -
            Test if systems can induce possible tasks, users might be trying to
            accomplish for the query.
          
 
          - 
            Temporal Summarization Track
            - Develop systems that allow users to efficiently monitor the
            information associated with an event over time.
          
 
          - 
            Terabyte Track
            - Test scalability of IR systems to large scale collection.
          
 
          - 
            Web Track -
            Explore information seeking behaviors common in general web search.
          
 
        
       
      - 
        GOV2 Test Collection
        - This is one of the largest Web collection of documents obtained from
        crawl of government websites by Charlie Clarke and Ian Soboroff, using
        NIST hardware and network, then formatted by Nick Craswel.
      
 
      - 
        NTCIR Test Collection
        - This is collection of wide variety of dataset ranging from Ad-hoc
        collection, Chinese IR collection, mobile clickthrough collections to
        medical collections. The focus of this collection is mostly on east
        asian languages and cross language information retrieval.
        
      
 
      - 
        Conference and Labs of the Evaluation Forum (CLEF) dataset
        - It contains a multi-lingual document collection. The test suite
        includes:
        
          - AdHoc - News Test suite.
 
          - 
            Domain Specific Test Suite - On collections of scientific articles.
          
 
          - Question Answering Test Suite.
 
        
       
      - 
        Reuters Corpora
        - The corpora is now available through NIST. The corpora includes
        following:
        
          - 
            RCV1 (Reuter’s Corpus Volume 1) - Consists of only English language
            News stories.
          
 
          - 
            RCV2 (Reuter’s Corpus Volume 2) - Consists of stories in 13
            languages (Dutch, French, German, Chinese, Japanese, Russian,
            Portuguese, Spanish, Latin American Spanish, Italian, Danish,
            Norwegian, and Swedish). Note that the stories are not parallel.
          
 
          - 
            TRC (Thomson Reuters Text Research Collection) - This is a fairly
            recent corpus consisting of 1,800,370 news stories covering the
            period from 2008-01-01 00:00:03 to 2009-02-28 23:54:14.
          
 
        
       
      - 
        20 Newsgroup dataset
        - This data set consists of 20000 newsgroup messages.posts taken from 20
        newsgroup topics.
      
 
      - 
        English Gigaword Fifth Edition
        - This data set is a comprehensive archive of English newswire text data
        including headlines, datelines and articles.
      
 
      - 
        Document Understanding Conference (DUC) datasets
        - Past newswire/paper datasets (DUC 2001 - DUC 2007) are available upon
        request.
      
 
    
    External Curation Links
    
    Talks
    Technical Talks
    
    Philosophical Talks
    
    Conferences
    
      - 
        Web Search and Data Mining Conference -
        WSDM.
      
 
      - 
        Special Interests Group on Information Retrieval -
        SIGIR.
      
 
      - 
        Text REtrieval Conference - TREC.
      
 
      - 
        European Conference on Information Retrieval -
        ECIR.
      
 
      - 
        World Wide Web Conference - WWW.
      
 
      - 
        Conference on Information and Knowledge Management -
        CIKM.
      
 
      - 
        Forum for Information Retrieval Evaluation -
        FIRE.
      
 
      - 
        Conference and Labs of the Evaluation Forum -
        CLEF.
      
 
      - 
        NII Testsbeds and Community for Information access Research -
        NTCIR.
      
 
    
    Blogs
    
    Interesting Reads
    
    License
    
      
    
    
      To the extent possible under law,
      Harshal Priyadarshi and
      all the contributors have waived all copyright and related or neighboring
      rights to this work.